HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-...

متن کامل

RCV1: A New Benchmark Collection for Text Categorization Research

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, ...

متن کامل

Categorization of Large Text Collections: Feature Selection for Training Neural Networks

Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: s...

متن کامل

Neural Text Categorizer for Exclusive Text Categorization

This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge d...

متن کامل

Text Representation for Automatic Text Categorization

Automatic Text Categorization (ATC), the automatic assignment of text documents to predefined classes, is a language engineering task very relevant to a number of applications, including automatic content and knowledge management in corporations and the Internet, information access and filtering, etc. With first works dating back to 60’s [14], and increased work in the last decade (see the surv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Computing Science and Engineering

سال: 2009

ISSN: 1976-4677

DOI: 10.5626/jcse.2009.3.3.165